About the project

The introduction to open data science course (IODS) will give us students an opportunity to get familiar with R programming language and some open software tools (Rstudio, R markdown and Github). Then later during the course, we will learn how these tools can be used in statistical data analysis.

Johanna’s github repository

Week 1

The R programming code and the tools we are using during this course are all new to me. I have little background in statistics, but I took those courses long time ago, so I don’t remember much about that topic either. In addition, I am a new mac user with very limited computer handling skills, so during this first week I have had some extra work trying to solve beginner’s issues such as how to download and install programs etc. Luckily there was a good amount of instructions available in the IODS-course page, including videos, so everything should be now installed and working, at least I hope so.


Week 2

In this week’s R exercise we are making the regression analysis from the questionare data about students’ learning methods (deep and surface learning) and attitudes toward the learning progress and study subject. The answers are scaled and combined together with final exam points, overall attitudes toward statistics, gender and age.

Before the analysis, we did data wrangling step (where the data table is converted into different type of file for analysis), and in this step the data was already filtered so that those answers that got 0 points are left out from the data table, the deep, strategic and learning question points were averiged and the attitude points were divided by the number of questions (about attitude).

Tables 1-3. Data tables about summary of data and its structure. Data table has seven different variables. The first variable is gender, which can be either female or male. The next one is age, and has integer value, as well as the exam points in the table. Other points in the table are averiged values from the multiple questions from that spesific topic.

# Reading the data into R (students2014) from the  local data wrangling file (learning2014.txt). Setting the header=TRUE because file contains the names of the variables as its first line.

students2014<-read.table("~/Tiedostot_J/IODS-project/data/learning2014.txt", header=TRUE)

#Analysis of the data (structure and summary)


str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
head(students2014)
##   gender age attitude     deep  stra     surf points
## 1      F  53       37 3.583333 3.375 2.583333     25
## 2      M  55       31 2.916667 2.750 3.166667     12
## 3      F  49       25 3.500000 3.625 2.250000     24
## 4      M  53       35 3.500000 3.125 2.250000     10
## 5      M  49       37 3.666667 3.625 2.833333     22
## 6      F  38       38 4.750000 3.625 2.416667     21
summary(students2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :14.00   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :32.00   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :31.43   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :50.00   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

Plot 1. To summarize we have made a plot for attitude vs exam points. These results are divided on two colours based on the gender and lastly regression line is drawn to the plot to visualize if there is a gender difference.

#Accessing libraries ggplot2 and GGally
library(ggplot2)
library(GGally)

#  Plot with ggplot (aes)
p1 <- ggplot(students2014, aes(x = attitude, y = points, col = gender))

#  point visualization
p2 <- p1 + geom_point()

# adding a regression line
p3 <- p2 + geom_smooth(method = "lm")

# adding a title and drawing the plot
p4 <- p3 + ggtitle("Student's attitude vs exam points")

# printing the plot
p4

Plot 2. Plot about the summaries of the variables in the data. In this plot there is easily seen the different relationships between the variables.

# Drawing plot matrix with ggpairs()
p <- ggpairs(students2014, mapping = aes(col=gender, alpha=2), lower = list(combo = wrap("facethist", bins = 20)))


# draw the plot
p

Table 4. In this table, there is actual linear regression model for multiple variables used. The exam points are the target of the analysis and it gives the model to us as well as the details about the residuals in the model. From coefficients we can see where it crosses the Y-axis (points, intercept) and the model also estimates for each variable, if it is significant from zero. In other words, does that variable make a difference to the points (P and t values). The attitude seems to have the highest impact to exam points between these three variables.

The multiple R squared of the model tells us how well this data fits into multiple linear regression model. It determines how well the residuals fit the theoretical model values. More we have outliers from the model, the lower we have R squared value of the model telling us that some other model would be needed to explain our data. In our case, it seems that there is quite a lot variability of the data, meaning that it is not that strong dependence on the exam points and between attitude, age and strategic learning.

# creating a regression model with multiple explanatory variables
my_model <- lm(points ~ attitude + age + stra, data = students2014)

# print out a summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + age + stra, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 6.17e-05 ***
## attitude     0.34808    0.05622   6.191 4.72e-09 ***
## age         -0.08822    0.05302  -1.664   0.0981 .  
## stra         1.00371    0.53434   1.878   0.0621 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 1.07e-08

Plot 3. Analysis of these diagnostic plots shows in the first plot again how well the residuals are corresponding the model fitted values. It should be scattered and not showing any kind of trend in this plot. The second plot is about normal QQ which is normality of errors. It should show no trend again going on some direction from the trendline. The third plot is about the leverage and residuals, and it takes into account that some residuals might have a larger effect on a model fitting than other. For example, large values might have larger effect than smaller ones when regression curve is fitted.

# drawing diagnostic plots with plot(). Choosing the plots 1, 2 and 5. Par to show multiple plots at the same time (room for four plots, only 3 used here)
par(mfrow=c(2,2))
plot(my_model, which=c(1:2,5))


Week 3

The data for this week’s analysis is from the UCI Machine Learning Repository: link to the data source.

The data has 33 different variables (questions). Many variables are binary values, such as school, which is answered by the student either as GP (Gabriel Pereira) or MS (Mousinho da Silveira). Also family size is binary: LE3 is less or equal to 3, GT3 is greater than 3. Some binary variables are just answered either yes or no. There are some other variables that are answered by number, such as age that is answered by integer, number from 15-22. Also some questions are answered nominally, such as mother’s job. In the end there is G1, G2 and G3, which are the numeric grades from periods 1, 2 and the final grade from student’s course subject, Math or Portuguese. There are many background questions and questions related to studying, but also questions about alcohol consumption during the workdays and weekends.

The largest difference from the original data and the data table presented here (Table 1) is the two extra columns we have added here after grade variables, which are about alcohol use and high use of alcohol. The alcohol use combines the two original questions Dalc (workday alcohol consumption) and Walc (weekend alcohol consumption), which were estimated in numbers from 1-5, 1 being low use and 5 high, and gives an average of those. The high use is true for students whose combined alcohol use was more than 2. So in our data table we have 35 variables instead of 33 original ones. Also the data has been modified in the data wrangling the two different questionnaires have been merged to one, combining the data files and excluding the possible duplicate answers if one student has taken both questionnaires.

Table 1. Variables in the combined data file.

alc<-read.table("~/Tiedostot_J/IODS-project/data/alc.txt", sep="", header=TRUE)

colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The purpose of our analysis is to study relationships between some variables in the data set and alcohol consumption. For this purpose, I have chosen four variables that could have been effected by how much student is using alcohol in his/her life. Two variables are likely to be positively affected by low alcohol consumption and values of two variables should increase if the student consumes more alcohol. The low alcohol consumption could possible increase the quality of family relationships and weekly study time. The high alcohol consumption could have an effect on amounts of absences in school and going out more often with friends. So, I have chosen the following four variables for further analysis:

These variables will be analyzed with high usage of alcohol, so the first graph should show low relationship with family when students are using high amounts of alcohol (TRUE in the graph), another should be also lower than for the students not using high amounts of alcohol (FALSE). In the two last figures it should be other way around, for high alcohol consumption the box should be higher than for low alcohol consumers.

In the data table 1 there is the number of those students who are consuming higher amounts of alcohol (high alcohol consumption = TRUE) and those who are consuming less (high alcohol consumption = FALSE). In the data table 2 there are the mean values of family relationships in the high alcohol consuming group (TRUE) and less alcohol consuming group (FALSE).

Boxplot 1. An effect of student’s alcohol consumption on family relationships.

#access to libraries tidyr, dplyr and ggplot2. Had to suppress dplyr startupmessage.
suppressPackageStartupMessages({
library("dplyr")
})
library(dplyr)
library(tidyr)
library(ggplot2)

# produce summary statistics by group. How many from students are in the high alcohol consumption group?
alc %>% group_by(high_use) %>% summarise(count = n())
## # A tibble: 2 × 2
##   high_use count
##      <lgl> <int>
## 1    FALSE   268
## 2     TRUE   114
# What is the mean value of family relationship values in low consumer's group?
alc %>% group_by(high_use) %>% summarise(mean_grade=mean(famrel))
## # A tibble: 2 × 2
##   high_use mean_grade
##      <lgl>      <dbl>
## 1    FALSE   4.003731
## 2     TRUE   3.780702
# initialize a box plot of high_use and famrel, studytime, goout and absences.
g1<-ggplot(alc, aes(x=high_use, y=famrel))

# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("Quality of family relationships")+xlab("Consumes high amounts of alcohol")+ggtitle("Effect of student's alcohol consumption on his/her family relationships")

So it seems that the alcohol consumption possibly very small effect on quality of the family relationships. For those who did not use high amounts of alcohol estimated their family relationships to 4.0 (average) and those who used high amounts, the average value was 3.8.

In the data table 3 there are the mean values of the time the students are using in the high alcohol consuming group (TRUE) and less alcohol consuming group (FALSE).

Boxplot 2. An effect of student’s alcohol consumption on time used studying.

#access to libraries tidyr, dplyr and ggplot2.
library(tidyr); library(dplyr); library(ggplot2)

# What is the mean time used for studies in high/low consumer's group?
alc %>% group_by(high_use) %>% summarise(mean_grade=mean(studytime))
## # A tibble: 2 × 2
##   high_use mean_grade
##      <lgl>      <dbl>
## 1    FALSE   2.149254
## 2     TRUE   1.771930
g2<-ggplot(alc, aes(x=high_use, y=studytime))
g2 + geom_boxplot() + ylab("Weekly study time")+xlab("Consumes high amounts of alcohol")+ggtitle("Effect of student's alcohol consumption on his/her study time")

The average study time in the group of low alcohol consumers was 2.1 hour, when it was 1.8 in the high amount consumer group. Again it seemed to have a small effect as was suspected.

In the data table 4 there are the mean values of the time the students are spending out with their friends in the high alcohol consuming group (TRUE) and less alcohol consuming group (FALSE).

Boxplot 3. An effect of student’s alcohol consumption on time spended out with their friends.

#access to libraries tidyr, dplyr and ggplot2.
library(tidyr); library(dplyr); library(ggplot2)

# What is the going out with friends in high/low consumer's group?
alc %>% group_by(high_use) %>% summarise(mean_grade=mean(goout))
## # A tibble: 2 × 2
##   high_use mean_grade
##      <lgl>      <dbl>
## 1    FALSE   2.854478
## 2     TRUE   3.719298
g3<-ggplot(alc, aes(x=high_use, y=goout)) 
g3 + geom_boxplot() + ylab("Go out with friends")+xlab("Consumes high amounts of alcohol")+ggtitle("Effect of student's alcohol consumption on going out with friends")

Again it seems to have small logic, those who consume alcohol more, tend to spend more time out with their friends. So this time the TRUE group had higher points than the FALSE group (2.9 vs 3.7).

In the data table 5 there are the mean values of the times they have been absent from the class with the same groups as previously (TRUE or FALSE on high alcohol consumption).

Boxplot 4. An effect of student’s alcohol consumption on times of absence from school.

#access to libraries tidyr, dplyr and ggplot2.
library(tidyr); library(dplyr); library(ggplot2)

# What is the going out with friends in high/low consumer's group?
alc %>% group_by(high_use) %>% summarise(mean_grade=mean(absences))
## # A tibble: 2 × 2
##   high_use mean_grade
##      <lgl>      <dbl>
## 1    FALSE   3.705224
## 2     TRUE   6.368421
g4<-ggplot(alc, aes(x=high_use, y=absences))
g4 + geom_boxplot() + ylab("Number of school absences")+xlab("Consumes high amounts of alcohol")+ggtitle("Effect of student's alcohol consumption on school absences")

So it seems that alcohol consumption had the highest impact on the absence from school. The group of high alcohol usage had been on average more than 6 times away from the school, whereas the other group was away less than 4 times on average.

Then the the relationships between the chosen variables and binary high/low alcohol consumption is explored statistically by logistic regression.

Table 6. The summary of fitted model and Table 7. the coefficients of the model.

m<-glm(high_use~famrel+studytime+goout+absences, data=alc, family="binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ famrel + studytime + goout + absences, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8701  -0.7738  -0.5019   0.8042   2.5416  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.28606    0.70957  -1.812  0.06992 .  
## famrel      -0.33699    0.13681  -2.463  0.01377 *  
## studytime   -0.55089    0.16789  -3.281  0.00103 ** 
## goout        0.75953    0.12041   6.308 2.82e-10 ***
## absences     0.06753    0.02175   3.104  0.00191 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 384.07  on 377  degrees of freedom
## AIC: 394.07
## 
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept)      famrel   studytime       goout    absences 
## -1.28606058 -0.33699130 -0.55089391  0.75953025  0.06753071

Also the logistic regression and coefficients of the model backs up the previous analysis. Coefficients are positive with the last two variables (goout and absences) but negative with the first ones (famrel) and (studytime).

Next the oddsratios and confidential intervals for the model are computed (Table 8).

#compute oddratios and confidential intervals
OR<-coef(m)%>%exp
CI<-confint(m)%>%exp
## Waiting for profiling to be done...
#Print these out together
cbind(OR,CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.2763573 0.06723596 1.0961732
## famrel      0.7139151 0.54460646 0.9331198
## studytime   0.5764343 0.41040872 0.7941804
## goout       2.1372720 1.69853389 2.7261579
## absences    1.0698631 1.02591583 1.1187950

Finally we are testing the predictive power of the model by 2x2 cross tabulation of predictions versus the actual values and predictions.

#predict the probability of high_use
probabilities<-predict(m, type="response")

#add the predicted probabilities to alc
alc<-mutate(alc, probability=probabilities)

#use the probabilities to make a prediction of high use
alc<-mutate(alc, prediction=probability>0.5)

#Tabulate the target variable vs the predictions
table(high_use=alc$high_use, prediction=alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   242   26
##    TRUE     65   49
#Initialize a plot of high use vs probability in alc
g<-ggplot(alc, aes(x=probability, y=high_use, col=prediction))

#Define the geom as point and draw the plot
g+geom_point()

#Tabulate the target variable vs the predictions
table(high_use=alc$high_use, prediction=alc$prediction)%>% prop.table()%>%addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.63350785 0.06806283 0.70157068
##    TRUE  0.17015707 0.12827225 0.29842932
##    Sum   0.80366492 0.19633508 1.00000000

Lastly, the total proportion of of inaccurately classified individuals (= the training error) is presented in the table below.

#define mean prediction error
loss_func<-function(class, prob){
  n_wrong<-abs(class-prob)>0.5
  mean(n_wrong)
}
#Call loss_func to cumpute the average number of wrong predictions 
loss_func(class=alc$high_use, prob=alc$probability)
## [1] 0.2382199

Week 4

In this week’s Rstudio exercise we will use the Boston data set. It has 506 objects and 14 different variables, so the dataset has 506 rows and 14 columns.

The columns in Boston data set are:

  1. crim: per capita crime rate by town
  2. zn: proportion of residential land zoned for lots over 25,000 sq.ft
  3. indus: proportion of non-retail business acres per town
  4. chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  5. nox: nitrogen oxides concentration (parts per 10 million)
  6. rm: average number of rooms per dwelling
  7. age: proportion of owner-occupied units built prior to 1940
  8. dis: weighted mean of distances to five Boston employment centres
  9. rad: index of accessibility to radial highways
  10. tax: full-value property-tax rate per $10,000
  11. ptratio: pupil-teacher ratio by town
  12. black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  13. lstat: lower status of the population (percent)
  14. medv: median value of owner-occupied homes in $1000s
# Load the Boston data from MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")

# Explore the structure and dimension of the data
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The structure view in table above shows that Boston data contains only numerical values.

In the plot below, the linear correlation between variables is roughly determined with scatterplot matrix. In the diagonal line from top left to bottom right there are Boston data variables and then each variable is plotted against each other.

In the next data table there are summaries of the variables. It seems that some of the variables have quite large distribution, such as crim and zn, while others are quite similar, ptratio for example. The values vary from 0 to 711, so they need to be scaled, if we wish to handle them simultaniously in the analysis.

Below these the correlation of the variables are more examined with correlation plot and visualized plot of this data. The visualized correlation plot shows nicely which variables are correlated and which are not.

# Access to dplyr and corrplot
library(dplyr)
library(corrplot)

# Scatterplot matrix of the variables
pairs(Boston)

# Summaries of the variables
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# For correlation, a "nicer"" correlation matrix is made wit corrplot 
cor_matrix<-cor(Boston)

# Print the correlation matrix (values rounded, first 2 digits)
cor_matrix%>%round(2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
# Visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)

The data is next standardized. The variables changed now so that values are now variating from -3.9 to 9.9. Standardization is done in the scaling so that the column means are substracted from their columns and the resulting difference is then divided with standard deviation.

# Center and standardize variables
boston_scaled<-scale(Boston)

# Summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
# Boston_scale is changed to data frame format
boston_scaled<-as.data.frame(boston_scaled)

# create a categorical variable of the crime rate 
# Creat scaled_crim
scaled_crim<-boston_scaled$crim

# create a quantile vector of crim
bins<-quantile(scaled_crim)

# create the categorical variable ´crime´
crime<-cut(scaled_crim, breaks=bins, include.lowest=TRUE, label=c("low", "med_low", "med_high", "high"))

# remove the original crim from the dataset
boston_scaled<-dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled<-data.frame(boston_scaled, crime)

# splitting the Boston data set to test and train sets
# number of rows in data set
n<-nrow(boston_scaled)

# choose randomly 80% of the rows
ind<-sample(n, size=n*0.8)

# create train and test sets
train<-boston_scaled[ind,]
test<-boston_scaled[-ind,]

The linear discriminant analysis is then fit to the train set. The categorical crime rate is the target variable and all the other variables in the dataset are predictor variables.

# LDA on the train set, categorical crime rate as target variable, all other variables as predictor variables.
lda.fit<-lda(crime~., data=train)

# target class as numeric
classes<-as.numeric(train$crime)

# draw the LDA (bi)plot, arrows added
plot(lda.fit, dimen=2, col=classes, pch=classes)

Next the crime categories are saved to test data and then the categorical crime variable is removed from the test dataset. After this prediction of the classes is made with LDA model and results are cross tabulated with the crime categories from the test set. From the cross tabulation, the prediction of teh model (prediction as probabilities), it seems that high prediction is the most accurate one and next is the low. Then the medium ones are not that well predicted.

# save the crime categories from the test data
correct_classes<-test$crime

# remove the categorical crime from test data
test<-dplyr::select(test, -crime)

# precict classes with test data
lda.pred<-predict(lda.fit, newdata=test)

# cross tabulate the results
table(correct=correct_classes, predicted=lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       13      11        1    0
##   med_low    6      20        2    0
##   med_high   1       9       13    0
##   high       0       0        0   26

Lastly, the distances between the observations are calculated and k-means algorith is run on the dataset.

# reload Boston data set
data(Boston)

# center and standardize variables
boston_scaled<-scale(Boston)

# calculate the distances between observations
dist_eu<-dist(Boston)

# k-means clustering
km<-kmeans(dist_eu, centers=15)

# plot the dataset with clusters
pairs(Boston, col=km$cluster)

Based on the plot, the optimal number of clusters is 2, because then the total WCSS drops radically in the figure and the algorithm is rerun with this. The clusters are then visualized with a plot where clusters are separated with colors and results then interpreted.

# set.seed to prevent producing different results everytime
set.seed(123)

# calculate the distances between observations
dist_eu<-dist(Boston)

# determine the number of clusters
k_max<-10

# calculate the total WCSS
twcss<-sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})

# visualize the results
plot(1:k_max, twcss, type='b')

# k-means clustering
km<-kmeans(dist_eu, centers=2)

# plot the dataset with clusters
pairs(Boston, col=km$cluster)


Week 5

This week we have used the data from United Nations Development Programme, Human Development Reports (hdr.undp.org).

From the reports, the two data sets of Human Development Index (HDI) and Gender Inequality Index were downloaded, combined, and some data variables were selected for this week’s analysis exercise.

# Load the human.txt data into R.

human<-read.table("~/Desktop/IODS-project/data/human.txt", sep="", header=T)

# Explore the structure and the dimensions of the data 
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155   8

In our data set we have 155 objects and 8 variables, so we have 155 rows (different countries) and 8 columns. All the variables are recognized as numerical, most of them have decimal values, but the GNI and Mat.Mor are integers. The chosen 8 variables (column heading abbreviations) from larger data set describe following things:

  1. The proportion of females divided with proportion of males with at least secondary education (Edu2.FM).
  2. The proportion of females divided with the proportion of males in the labour force. (Labo.FM)
  3. Expected years of schooling for children of school entering age (Edu.Exp)
  4. Life expectancy at birth (Life.Exp)
  5. Gross national income per capita (GNI)
  6. Maternal mortality ratio (Mat.Mor)
  7. Adolescent birth rate (Ado.Birth)
  8. Percentage of female representatives in parliament (Parli.F)
# Show a graphical overview of the data
# Access to ggplot2 and tidyr
library(ggplot2)
library(tidyr)

# For easier overview of the data, the data set is scaled and class changed from numeric to data frame.
human_scaled<-scale(human)
human_scaled<-as.data.frame(human_scaled)

# Next scaled data is gathered to key-value pairs and plotted for visualization of the data.
gather(human_scaled)%>%ggplot(aes(value))+facet_wrap("key", scales="free_y", )+geom_histogram(bins=155, col="darkblue")+xlim(-6,6)

# To see how scaling affected another histograms are made for the original data.
gather(human)%>%ggplot(aes(value))+facet_wrap("key", scales="free", )+geom_histogram(bins=155, col="darkred")

From the blue histograms (scaled values) it is perhaps easier to see, which variables are normally distributed, and which are not. Ado.Birth, GNI and Mat.Mor are not following the normal distribution. From red histograms (not-scaled histograms) it is obvious that data set doesn’t have negative values, which makes sense, because none of the variables in the data table can have negative value.

The summary tables below are first for not-scaled data and the next one is for the scaled data. These two tables (and histograms above) show that largest distribution is with GNI variable and least distributed is Life.Exp.

# Creating summary tables for both scaled and original human data
summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
summary(human_scaled)
##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850

The relationships between variables can be visualized by creating correlation plot. In this first correlation plot below, the distribution of the data variables is again visible in the middle of the plot (from higher left to lower bottom). The highest correlation value can be seen between variables Life.Exp and Edu.Exp. This is positively correlated, so in countries where expected years of schooling is high, the life expectancy is high also. Negatively correlated is the Mat.Mor and Life.Exp, so in those places, where life expectancy is high, the maternal mortality ratio is low. Least correlated were Edu2.FM and Labo.FM, so the proportion of females divided with proportion of males with at least secondary education and the proportion of females divided with the proportion of males in the labour force were not correlated. The same observations can be seen from the second correlation plot below.

# Access GGally and corrplot
library(GGally)
library(corrplot)
# Visualize data variables with ggpairs()
ggpairs(human)

# Compute correlation matrix and visualize it with corrplot()
cor(human)%>%corrplot(order ="hclust", method="color", type="lower")

The not standardized human data is then analyzed with PCA (principal component analysis). The variability of this data is almost completely from component 1 (99.99% of variability) and the second component (orthogonal to PC1) makes the 0.01% variation in the data. The other principal components don’t have importance in terms of captured variance. So, we have the two principal components (uncorrelated variables) which will capture the highest variation. The angles of arrows, however, are all looking the same, so this is not giving us much information about the correlation, because all the arrows are small angle with PC1 axis, meaning that the correlation is high between the features and PC1. The length of the arrow represents the standard deviation of the feature.

# Perform principal component analysis on the not standardized human data (with the SVD method)
pca_human<-prcomp(human)

# Create and print summary of pca_human
s<-summary(pca_human)
s
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
# Rounded percentages of variance captured by each PC, print them out
pca_pr<-round(1*s$importance[2,]*100, digits=2)
pca_pr
##   PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8 
## 99.99  0.01  0.00  0.00  0.00  0.00  0.00  0.00
# Create object pc_lab to be used as axis labels
pc_lab<-paste0(names(pca_pr), "(", pca_pr, "%)")

# Draw a biplot (and suppress warnings for zero-lenght arrows)
suppressWarnings(biplot(pca_human, cex=c(0.8,1), col=c("grey40", "deeppink2"), xlab=pc_lab, ylab=NA, main="Biplot using not standardized human data"))

Next the PCA is done for the standardized variables. Now we can get more principal components. Comparing not-standardized and standardized PCAs, it is clear that PCA is sensitive to the scaling of original values, assuming that features that have larger variance are more important. From this standardized biplot it is clearly visible that Labo.FM and Parli.F are their own group, as well Mat.Mor and Ado.Birth are, and all the rest 4 variables are in the grouped on the left side of this biplot. So this means that maternal mortality ratio and adolescent birth rate are somehow related onto similar features and they are quite opposite of the high life expectancy at birth, expected years of schooling for children of school entering age, gross national income per capita and the proportion of females divided with proportion of males with at least secondary education. In other words, if we think about countries, those countries are more left which have higher education (also for girls), population health, and their people have more money to use. On the right are the countries which have higher mortality ratio for mother in childbirth and under aged girls are becoming mothers. On the top there are countries with larger number proportion of females in work life or having place in the parliament, while below part of the biplot there are less females than males in the parliament and work.

# Perform principal component analysis on the standardized human data (with the SVD method)
pca_human_scaled<-prcomp(human_scaled)

# Create and print summary of pca_human
s_s<-summary(pca_human_scaled)
s_s
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000
# Rounded percentages of variance captured by each PC, print them out
pca_pr2<-round(1*s_s$importance[2,]*100, digits=1)
pca_pr2
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
# Create object pc_lab to be used as axis labels
pc_lab<-paste0(names(pca_pr2), "(", pca_pr2, "%)")

# Draw a biplot for standardized data 
biplot(pca_human_scaled, cex=c(0.8,1), col=c("grey40", "deeppink2"), xlab=pc_lab, ylab=NA, main="Biplot using standardized human data")

Based on the biplot drawn after PCA on the standardized human data shows that the first two principal components PC1 is about how developed the country is (health, economical and educational point of view) and PC2 is mainly describing the role of the woman in the country (does women stay home or work and do they have a voice in country’s political system). Because the angle between Parli.F or LaboFM is large between these two arrows and PC1, they are not correlated how developed the country is economically. From arrows it is possible to divide arrows into three different groups. Inside these groups the correlation between the variables (features) is rather small (small angle), but between the groups it is high (large angle). Inside these groups, Parli.F and Labo.FM have the largest angle between those their arrows, which means that their relationship is less correlated than features inside other two groups. The length of the arrows is quite same, perhaps a bit smaller are the Edu2.FM and GNI. In addition, these two last mentioned features have the highest correlation between them (smallest angle). If we return to first correlation tests coming from correlation plot, it told us that highest correlation would be Edu.Exp and Life.Exp, which is not quite same but close what we get from the pca biplot and arrows. However, Mat.Mor and Life.Exp seems to be negatively correlated as they were in the correlation plot and Edu2.FM and Labo.FM have very close to 90 degrees angle between their arrows, which means that they are the least correlated arrows (features) in this analysis.

The tea data set is loaded from the package FactoMineR. In this data set is a questionnaire on tea, where 300 individuals were asked how they drink their tea with 18 questions. Besides that there were questions about their product’s perception with 12 questions and 4 questions about their personal details. So there is 36 columns and 300 rows, rows are the individuals that answered the questionnaire and columns are the different questions. The first 18 questions are active, the age question is a supplementary quantitative variable and the last questions are supplementary categorical variables.

# Access to FactoMineR (includes tea data) and dplyr (suppress warnings)
suppressPackageStartupMessages({
library("dplyr")
})
library(FactoMineR)
library(dplyr)

# The names of the datasets in the FactoMineR
data(package="FactoMineR")

# Load tea dataset from the package
data(tea)

# Summary and structure (questions) of the tea dataset
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255  
##  Not.feminine:171   sophisticated    :215   slimming   : 45  
##                                                              
##                                                              
##                                                              
##                                                              
##                                                              
##         exciting          relaxing              effect.on.health
##  exciting   :116   No.relaxing:113   effect on health   : 66    
##  No.exciting:184   relaxing   :187   No.effect on health:234    
##                                                                 
##                                                                 
##                                                                 
##                                                                 
## 
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

In above there is a structure of tea data. It has the 300 objects and 36 variables, as it should have. The questions are the columns in the data set and they are factors with 2 or more levels meaning that the individuals need to choose the answer from given 2 or more options. Only difference is the age, where the answer needs to be integer.

If we first take a look at those 18 first questions about how individuals have their tea. The first group of bar plots is about these questions. Then next six are personal questions and about the age group and frequency of having tea, and lastly group of 12 bar plots are about the product’s perception.

# Dividing data for groups for analysis (tea_how, personal information, perception)
keep_columns1<-c("breakfast", "tea.time", "evening", "lunch","dinner","always","home","work","tearoom","friends","resto","pub","Tea","How","sugar","how","where","price")
keep_columns2<-c("age","sex", "SPC","Sport","age_Q","frequency")
keep_columns3<-c("escape.exoticism", "spirituality", "healthy", "diuretic", "friendliness", "iron.absorption", "feminine", "sophisticated", "slimming", "exciting", "relaxing")

tea_how<-dplyr::select(tea, one_of(keep_columns1))
tea_inf<-dplyr::select(tea, one_of(keep_columns2))
tea_per<-dplyr::select(tea, one_of(keep_columns3))

# visualize the how dataset (and suppress some warnings)
suppressWarnings(gather(tea_how)%>%ggplot(aes(value))+facet_wrap("key", nrow=3, scales="free")+geom_bar(fill="darkblue")+theme(axis.text.x=element_text(angle=45, hjust=1, size=7), axis.text.y=element_text(hjust=1, size=6), axis.title.x = element_blank(), axis.title.y=element_blank()))

suppressWarnings(gather(tea_inf)%>%ggplot(aes(value))+facet_wrap("key", scales="free")+geom_bar(fill="darkred")+theme(axis.text.x=element_text(angle=90, hjust=1, size=7), axis.text.y=element_text(hjust=1, size=6), axis.title.x = element_blank(), axis.title.y=element_blank()))

suppressWarnings(gather(tea_per)%>%ggplot(aes(value))+facet_wrap("key", scales="free")+geom_bar(fill="darkgreen")+theme(axis.text.x=element_text(angle=45, hjust=1, size=7), axis.text.y=element_text(hjust=1, size=6), axis.title.x = element_blank(), axis.title.y=element_blank()))

Then Multiple Correspondence Analysis is done for the whole tea data. First there is the summary of the mca and then two separate variable biplots.

# MCA for tea_how, having 
mca_tea<-MCA(tea, quanti.sup=19, quali.sup=20:36, graph=F)

# Summary for the models 
summary(mca_tea)
## 
## Call:
## MCA(X = tea, quanti.sup = 19, quali.sup = 20:36, graph = F) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.148   0.122   0.090   0.078   0.074   0.071
## % of var.              9.885   8.103   6.001   5.204   4.917   4.759
## Cumulative % of var.   9.885  17.988  23.989  29.192  34.109  38.868
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.068   0.065   0.062   0.059   0.057   0.054
## % of var.              4.522   4.355   4.123   3.902   3.805   3.628
## Cumulative % of var.  43.390  47.745  51.867  55.769  59.574  63.202
##                       Dim.13  Dim.14  Dim.15  Dim.16  Dim.17  Dim.18
## Variance               0.052   0.049   0.048   0.047   0.046   0.040
## % of var.              3.462   3.250   3.221   3.127   3.037   2.683
## Cumulative % of var.  66.664  69.914  73.135  76.262  79.298  81.982
##                       Dim.19  Dim.20  Dim.21  Dim.22  Dim.23  Dim.24
## Variance               0.038   0.037   0.036   0.035   0.031   0.029
## % of var.              2.541   2.438   2.378   2.323   2.055   1.915
## Cumulative % of var.  84.523  86.961  89.339  91.662  93.717  95.633
##                       Dim.25  Dim.26  Dim.27
## Variance               0.027   0.021   0.017
## % of var.              1.821   1.407   1.139
## Cumulative % of var.  97.454  98.861 100.000
## 
## Individuals (the 10 first)
##                  Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1             | -0.541  0.658  0.143 | -0.149  0.061  0.011 | -0.306
## 2             | -0.361  0.293  0.133 | -0.078  0.017  0.006 | -0.633
## 3             |  0.073  0.012  0.003 | -0.169  0.079  0.018 |  0.246
## 4             | -0.572  0.735  0.235 |  0.018  0.001  0.000 |  0.203
## 5             | -0.253  0.144  0.079 | -0.118  0.038  0.017 |  0.006
## 6             | -0.684  1.053  0.231 |  0.032  0.003  0.001 | -0.018
## 7             | -0.111  0.027  0.022 | -0.182  0.090  0.059 | -0.207
## 8             | -0.210  0.099  0.043 | -0.068  0.013  0.004 | -0.421
## 9             |  0.118  0.031  0.012 |  0.229  0.144  0.044 | -0.538
## 10            |  0.258  0.150  0.045 |  0.478  0.627  0.156 | -0.482
##                  ctr   cos2  
## 1              0.347  0.046 |
## 2              1.483  0.409 |
## 3              0.224  0.038 |
## 4              0.153  0.030 |
## 5              0.000  0.000 |
## 6              0.001  0.000 |
## 7              0.159  0.077 |
## 8              0.655  0.174 |
## 9              1.070  0.244 |
## 10             0.861  0.158 |
## 
## Categories (the 10 first)
##                  Dim.1    ctr   cos2 v.test    Dim.2    ctr   cos2 v.test
## breakfast     |  0.166  0.495  0.025  2.756 | -0.166  0.607  0.026 -2.764
## Not.breakfast | -0.153  0.457  0.025 -2.756 |  0.154  0.560  0.026  2.764
## Not.tea time  | -0.498  4.053  0.192 -7.578 |  0.093  0.174  0.007  1.423
## tea time      |  0.386  3.142  0.192  7.578 | -0.072  0.135  0.007 -1.423
## evening       |  0.319  1.307  0.053  3.985 | -0.058  0.053  0.002 -0.728
## Not.evening   | -0.167  0.683  0.053 -3.985 |  0.030  0.028  0.002  0.728
## lunch         |  0.659  2.385  0.075  4.722 | -0.390  1.018  0.026 -2.793
## Not.lunch     | -0.113  0.410  0.075 -4.722 |  0.067  0.175  0.026  2.793
## dinner        | -0.661  1.146  0.033 -3.136 |  0.796  2.025  0.048  3.774
## Not.dinner    |  0.050  0.086  0.033  3.136 | -0.060  0.152  0.048 -3.774
##                  Dim.3    ctr   cos2 v.test  
## breakfast     | -0.483  6.900  0.215 -8.017 |
## Not.breakfast |  0.445  6.369  0.215  8.017 |
## Not.tea time  |  0.265  1.886  0.054  4.027 |
## tea time      | -0.205  1.462  0.054 -4.027 |
## evening       |  0.451  4.312  0.106  5.640 |
## Not.evening   | -0.236  2.254  0.106 -5.640 |
## lunch         |  0.301  0.822  0.016  2.160 |
## Not.lunch     | -0.052  0.141  0.016 -2.160 |
## dinner        |  0.535  1.235  0.022  2.537 |
## Not.dinner    | -0.040  0.093  0.022 -2.537 |
## 
## Categorical variables (eta2)
##                 Dim.1 Dim.2 Dim.3  
## breakfast     | 0.025 0.026 0.215 |
## tea.time      | 0.192 0.007 0.054 |
## evening       | 0.053 0.002 0.106 |
## lunch         | 0.075 0.026 0.016 |
## dinner        | 0.033 0.048 0.022 |
## always        | 0.045 0.001 0.101 |
## home          | 0.005 0.000 0.134 |
## work          | 0.112 0.043 0.005 |
## tearoom       | 0.372 0.022 0.008 |
## friends       | 0.243 0.015 0.103 |
## 
## Supplementary categories (the 10 first)
##                  Dim.1   cos2 v.test    Dim.2   cos2 v.test    Dim.3
## F             |  0.151  0.033  3.158 | -0.109  0.017 -2.278 | -0.048
## M             | -0.221  0.033 -3.158 |  0.159  0.017  2.278 |  0.070
## employee      | -0.153  0.006 -1.313 | -0.151  0.006 -1.289 |  0.103
## middle        | -0.030  0.000 -0.205 |  0.336  0.017  2.281 | -0.284
## non-worker    | -0.036  0.000 -0.324 |  0.185  0.009  1.666 | -0.291
## other worker  |  0.040  0.000  0.187 |  0.013  0.000  0.061 | -0.063
## senior        |  0.415  0.023  2.608 |  0.072  0.001  0.452 | -0.187
## student       |  0.032  0.000  0.305 | -0.317  0.031 -3.022 |  0.394
## workman       | -0.417  0.007 -1.473 |  0.249  0.003  0.878 |  0.343
## Not.sportsman | -0.030  0.001 -0.426 |  0.018  0.000  0.260 | -0.051
##                 cos2 v.test  
## F              0.003 -0.998 |
## M              0.003  0.998 |
## employee       0.003  0.884 |
## middle         0.012 -1.928 |
## non-worker     0.023 -2.620 |
## other worker   0.000 -0.289 |
## senior         0.005 -1.177 |
## student        0.047  3.760 |
## workman        0.005  1.209 |
## Not.sportsman  0.002 -0.721 |
## 
## Supplementary categorical variables (eta2)
##                    Dim.1 Dim.2 Dim.3  
## sex              | 0.033 0.017 0.003 |
## SPC              | 0.032 0.053 0.076 |
## Sport            | 0.001 0.000 0.002 |
## age_Q            | 0.008 0.077 0.146 |
## frequency        | 0.094 0.006 0.064 |
## escape.exoticism | 0.000 0.007 0.000 |
## spirituality     | 0.005 0.000 0.016 |
## healthy          | 0.000 0.000 0.008 |
## diuretic         | 0.004 0.000 0.013 |
## friendliness     | 0.071 0.001 0.013 |
## 
## Supplementary continuous variable
##                  Dim.1    Dim.2    Dim.3  
## age           |  0.042 |  0.204 | -0.340 |
# Visualize mca_tea
plot(mca_tea, invisible=c("ind"), cex=0.8, selectMod="contrib 20", unselect="grey30") 

# Another visualization of the mca_tea 
plot(mca_tea, choix="var", xlim=c(0,0.5), ylim=c(0,0.5), cex=0.7)

From the summary table one can read that the dimension 1 is explaining approximately 10% of the variance and the second dimension around 8%. Categorical variables are the squared correlation between each variable and dimension. If the value is close to 1, then there is a strong link between the current dimension and variable. From the summary table, the strongest link is between tearoom and dimension 1 (from 10 most correlated categorical variables). This is best visualized in the second biplot, where tearoom stands on the far right on the x-axis.

In the first MCA biplot there is no group of individuals, only variables. In the x-axis, there is dimension 1, and Y-axis is dimension 2. The 20 most contributed variables are written in red color, other variables are just grey triangles in the MCA biplot.

The second biplot is coloring the variables so in red you can see the active variables, in green the supplementary categorical variables and in blue the supplementary continuous variables. The coordinate for a categorical variable on a dimension is squared correlation ratio between the dimension and the categorical variable, where as the coordinate for a continuous variable on a dimension is a squared correlation coefficient between the dimension and the continuous variable. The distance between variable categories gives a measure in their similarity, so in the last plot it is seen that resto and friends are closer than Tea and friends.